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ABSTRACT 

A  framework  for  object  recognition  via  combinations  of  nonrigid  deformable  appearance  models  is  described.  An 
object  category  is  represented  as  a  combination  of  deformed  prototypical  images.  An  object  in  an  image  can  be 
represented  in  terms  of  its  geometry  (shape)  and  its  texture  (visual  appearance).  We  employ  finite  element  based 
methods  to  represent  the  shape  deformations  more  reliably  and  automatically  register  the  object  images  by  warping 
them  onto  the  underlying  finite  element  mesh  for  each  prototype  shape.  Vectors  of  objects  from  the  same  class 
(like  faces)  can  be  thought  to  define  an  object  subspace.  Assuming  that  we  have  enough  prototype  images  that 
encompass  major  variations  inside  the  class,  we  can  span  the  complete  object  subspace.  Thereafter,  by  virtue  of  our 
subspace  assumption,  we  can  express  any  novel  object  from  the  same  class  as  a  combination  of  the  prototype  vectors. 
We  present  experimental  results  to  evaluate  this  strategy  and  finally,  explore  the  usefulness  of  the  combination 
parameters  for  analysis,  recognition  and  low- dimensional  object  encoding. 
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1.  INTRODUCTION 

Optimal  and  reliable  description  of  objects  has  been  one  of  the  primary  goals  of  computer  vision.  Significant  amount 
of  research  has  been  conducted  for  deriving  mathematical  models  of  objects  from  images.  Such  descriptions  have 
been  useful  for  purposes  like  object  recognition  and  image  analysis.  Important  characteristics  of  such  descriptions 
are  that  they  should  be  easily  computable  and  unique. 

A  common  strategy  employed  in  computer  vision  to  design  effective  algorithms  is  to  emulate  methods  of  reasoning 
believed  to  be  used  by  human  beings  to  perform  image  analysis .  Various  psychophysical  and  physiological  studies1-6 
have  indicated  that  the  human  visual  system  uses  strategies  that  encode  three  dimensional  objects  as  multiple 
viewpoint-specific  representations  that  are  largely  two-dimensional  with  appropriate  depth  information.3’5  Various 
test  evidences  and  computational  simulations  indicate  that  view  interpolation  offers  a  plausible  explanation  for 
viewpoint  dependent  performance  of  human  response  times  and  error  rates  for  recognition.1 

The  psychophysical  studies  stated  above  strongly  motivate  us  to  devise  algorithms  that  will  represent  object 
classes  in  terms  of  2D  prototype  images.  An  object  can  be  fully  described  by  its  two  components,  namely  shape  and 
appearance*.  Hence,  given  sufficient  number  of  good  prototypes  that  encompass  appropriate  in-class  variations,  our 
goal  is  to  build  a  deformable  appearance  model  that  will  reliably  describe  an  object  class  by  linear  combination  of 
prototypes,  i.e.  the  parameters  of  a  novel  object  can  be  obtained  by  a  linear  combination  of  the  prototype  parameters 
in  the  training  set. 

Figure  1  gives  a  pictorial  description  of  the  approach  for  three  prototype  images.  The  images  at  the  vertices  of 
the  triangle  represent  three  prototypes.  The  prototypes  can  be  registered  with  each  other  by  warping  them  onto  the 
average  image.  The  shape  and  texture  of  each  prototype  can  then  be  combined  to  generate  new  images.  Here  the 
intermediate  images  are  represented  along  the  edges  of  the  triangle.  For  each  intermediate  image,  the  contribution 
of  the  adjacent  prototypes  is  more  than  that  of  the  one  farther  away. 

*The  notion  of  appearance  is  same  as  the  object  texture.  Both  terms  will  be  used  interchangeably 
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Figure  1.  Basic  Idea:  Combination  of  prototypes.  The  images  at  the  vertices  of  the  triangle  are  the  prototype 
images  which  can  be  registered  with  each  other  by  warping  onto  the  average  image.  The  prototype  shape  and  the 
texture  parameters  obtained  by  registration  can  be  combined  to  generate  new  images. 

2.  OBJECT  REPRESENTATION 

Based  on  the  premise  that  objects  can  be  represented  by  their  shape  and  appearance,  computer  vision  algorithms 
for  object  representation  can  be  categorized  into  two  classes,  namely  shape  models  and  appearance  models.  Initial 
techniques  for  shape  and  appearance  modeling,  were  built  independent  of  each  other.  Either  class  of  techniques, 
ignored  the  parameterization  of  the  other  feature.  This  section  throws  light  on  some  deformable  shape  and  view-based 
representations  relevant  for  the  formulation  of  our  approach  and  then  explores  some  methods  that  try  to  combine 
both  representations  to  handle  greater  variations  robustly. 

2.1.  Shape  Modeling 

Initial  shape  representation  methods  concentrated  on  ways  to  employ  flexible  models  by  constraining  the  solution 
space  of  allowed  deformations.  Kass,  Witkin  and  Terzopoulos7  described  a  method  of  representing  objects  in  images 
as  active  contours  or  energy  minimizing  splines  that  were  guided  by  external  constraint  forces  and  influenced  by 
image  forces  along  the  image  gradients.  Cootes  proposed  the  Chord  Length  Distribution  or  the  Point  Distribution 
Model,8  a  method  of  shape  representation  that  estimates  the  chord-lengths  where  each  object  is  represented  as  an 
n- vertex  polygon. 

Sclaroff  and  Pentland9,10  had  proposed  a  method  of  representing  objects  in  terms  of  modal  descriptions ,  which 
is  based  on  the  idea  of  describing  objects  by  their  generalized  symmetries,  as  defined  by  the  object’s  deformation 
modes.  Unlike  the  point  distribution  model ,  which  statistically  modeled  object  shapes,  this  method  physically 
modeled  objects  by  determining  the  modes  of  free  vibrations  of  the  object.  The  modes  of  an  object  define  an 
orthogonal  object-centered  coordinate  system  where  each  feature  point  can  be  uniquely  described  as  a  combination 
of  those  modes.  Cootes  and  Taylor  later  proposed  a  new  method  by  combining  this  physically  based  method  with 
their  statistically  based  method.11 

2.2.  Appearance  Modeling 

Appearance-based  models  seek  to  obtain  a  compact  representation  for  intensity  distribution.  One  such  set  of  tech¬ 
niques  employ  eigen-based  methods  to  compress  an  image  by  projecting  it  onto  a  low-dimensional  orthogonal  basis, 
the  eigenspace.12~lb  This  orthogonal  basis  is  usually  statistically  learned  by  using  principal  components  analysis 
(PC A)  or  Karhunen-Loeve  expansion  on  a  large  set  of  training  data. 

The  concept  of  point  distribution  models  was  extended  to  model  intensity  distributions.  These  models  are 
known  as  Appearance  models 16  and  have  been  claimed  to  address  the  problem  of  shape  normalization  which  was  not 
addressed  in  eigenfaces.  This  method  requires  labeled  examples  for  training. 


2.3.  Combined  Shape  and  Appearance  Modeling 

In  order  to  avoid  the  implicit  parameterization  of  shape  in  appearance  models  and  make  shape  models  more  photo¬ 
realistic,  there  has  been  a  growing  interest  in  modeling  both  shape  and  appearance  in  a  single  model.  Nastar, 
Moghaddam  and  Pentland17  had  combined  physically  based  modes  of  vibrations  with  statistically-based  modes  of 
variation  by  considering  each  point  in  the  image  as  a  triplet  of  (x,y,I(xyy))  and  doing  manifold  matching  in  this 
XYI  space.  Although  this  method  combined  both  the  statistical  and  physical  modes  of  variation,  it  is  dependent  on 
good  initialization. 

Ullman  and  Basri18  have  showed  that  an  object  can  be  represented  as  a  combination  of  2-D  images  where  the 
images  are  represented  in  terms  of  some  linear  transformations  in  the  3-D  space.  However,  this  method  assumes  a 
linear  framework  for  object  deformations  and  handles  only  limited  non-rigid  deformations. 

Poggio,  Jones  and  Vetter19”22  have  suggested  that  given  sufficient  number  of  prototypes,  the  parameter  vectors 
define  a  linear  space  and  span  the  model  space.  Any  novel  object  can  then  be  expressed  as  some  combination  of  those 
prototype  vectors.  This  method  combines  shape  or  geometry  with  texture  or  appearance  in  a  way  that  minimizes 
both  shape  and  appearance  parameters  to  fit  the  model.  This  is  a  robust  method  as  the  problem  of  model  fitting  is 
solved  as  a  global  non-linear  minimization  problem. 

Cootes,  Edwards  and  Taylor23  have  also  suggested  a  combined  formulation  of  their  appearance  and  active  shape 
models  to  develop  a  new  model  known  as  the  Active  Appearance  Models .  This  method  does  PCA  in  both  the  shape 
and  the  texture  spaces  separately  and  then  combines  them  and  again  does  PCA  to  remove  redundancies  between 
shape  and  texture  parameters.  All  objects  are  then  represented  as  some  combination  in  this  orthogonal  model. 

3.  MATHEMATICAL  FORMULATION 

Let  be  the  N  prototypes  available  for  training  the  system.  Let  Iref  be  the  reference  image.  The 

objective  is  to  define  a  framework  whereby  all  the  prototype  images  can  be  combined  to  generate  images  of  novel 
objects  from  the  same  class.  The  formulation  described  here  is  similar  in  flavor  to  that  developed  by  Jones  and 
Poggio,19  though  the  shape  deformations  are  determined  by  finite  element  methods  as  opposed  to  optical  flow 
methods.  The  prototype  images  are  initially  not  in  correspondence  and  hence  cannot  be  combined.  This  emphasizes 
the  determination  of  pixel  to  pixel  correspondences  amongst  the  prototype  images.  Let  Si,  £2,  •  •  • ,  Sn  be  a  set  of 
shape  parameters  such  that  each  Si  can  be  used  to  warp  the  ith  prototype  image  onto  the  reference  image,  thereby 
bringing  the  prototype  image  into  correspondence  with  the  reference  image,  i.e. 

Si(x,y)  =  (x,y)  (1) 

where  (x,y)  is  the  point  in  I*  which  corresponds  to  ( x,y )  in  Iref.  We  define, 

Ti(x,y)  =  W-1(IitSi)(xty)  (2) 

where  W  is  the  warping  function.  Thus,  for  each  prototype  li  in  the  training  set,  we  obtain  a  shape  vector  Si  and 
a  inverse  warped  texture  vector  T{.  Note  that  the  texture  vectors  are  shape-free  as  all  of  the  prototype  images  are 
inverse  warped  onto  the  same  reference  prototype  image. 

Given  a  large  number  of  prototypes  which  appropriately  vary  from  each  other  with  respect  to  different  charac¬ 
teristics  of  the  object  class,  we  can  define  a  set  of  parameters  b  =  [61,  . .  • ,  and  c  =  [ci ,  c2 , . .  -  ,cjv]  such  that 

the  shape  and  the  texture  of  a  novel  object  InoVei  (not  in  the  prototype  set)  can  be  derived  as  a  combination  of  the 
prototype  shape  and  texture  parameters. 

N 

Snovel  ~  ^  ^  C^Si  =  C  ’  S  (3) 

i=  1 
N 

Tnovci  =  J^biTi  =  b-T  (4) 

i=l 

Therefore,  the  equation  for  the  novel  image  can  be  defined  as  follows: 


W-1(/„0„ez,c-S)=b-T 


(5) 


Hence  the  matching  phase  reduces  to  matching  the  the  novel  image,  which  can  be  done  by  minimizing  the  sum  of 
squared  differences  (SSD)  error 


E(c,b)  =  i  ^[W-^InovebC  •  S)(x,y)  -  (b  •  T)(x,y)]2 

X,y 


(6) 


The  values  of  the  parameters  c  and  b  so  obtained,  provide  a  compact  representation  of  the  novel  image  in  terms  of 
the  prototypes  in  the  training  set.  Since,  the  shape  and  the  texture  vectors  of  the  prototypes  define  two  completely 
different  linear  subspaces  for  the  object  class  and  may  or  may  not  be  independent  of  each  other,  an  important  caveat 
involved  here  is  the  combined  estimation  of  both  the  shape  and  texture  parameters.  Equation  6  is  the  basic  equation 
that  describes  the  mathematical  formulation  of  the  system.  Further  constraints  may  be  employed  depending  upon 
the  modeling  of  the  parameters  (see  Section  5).  We  use  a  non-linear  technique  for  minimization.  In  order  to  avoid 
getting  trapped  in  the  local  minima,  we  use  Gaussian  pyramids.  Both  topics  are  described  in  brief  here. 

3.1.  Minimization 

For  the  minimization  of  the  objective  function,  we  use  Levenberg-Marquardt  method,24  a  non-linear  npt.imi7.flt.inn 
technique.  This  technique  uses  a  combination  of  linear  and  non-linear  approaches  for  updating  parameters  during 
each  iteration.  Smooth  switching  between  the  two  approaches  is  accomplished  by  a  weighting  term  A.  When  the 
magnitude  of  A  is  low,  the  minimization  is  done  in  a  linearized  fashion  by  Gauss-Newton  method  whereas  higher 
magnitude  of  A  forces  the  system  to  be  solved  in  quadratic  fashion  by  using  Gradient  Descent  technique. 

The  mathematical  formulation  is  as  follows.  Given  an  objective  function  E,  the  parameters  of  which  are  q',  the 
goal  is  to  determine  an  instance  q  that  minimizes  the  value  of  E.  This  is  achieved  iteratively  by  solving  the  following 
set  of  simultaneous  equations: 

(. H  +  XI)Aq  =  g  (7) 

q'  =  q  +  Aq  (8) 

where  H,  g  and  A  are  the  Hessian  matrix,  the  gradient  vector  and  the  controlling  parameter  respectively.  The 
gradient  vector  and  the  Hessian  matrix  are  determined  as  follows: 


dE 

9k  =  ~Wk 


.  dEdE 
kl  ~  dqk  dqi 


(9) 

(10) 


The  cost  of  the  objective  function  is  determined  with  the  updated  parameter  values  q' .  If  the  cost  has  decreased  as 
compared  to  its  previous  value  then  the  system  tends  to  linear  minimization  by  scaling  down  A  by  a  factor  of  10.  If 
the  cost  has  increased  then  the  system  moves  towards  quadratic  minimization  by  scaling  up  A  by  10.  In  the  former 
case,  the  parameters  are  updated  to  qr ,  whereas  in  the  latter  case,  the  updated  parameter  vector  q '  is  discarded  and 
we  proceed  with  the  old  parameter  vector  q.  Higher  values  of  A  restrict  parameter  displacement  in  the  error  space 
and  force  the  solution  to  move  along  the  steepest  gradient  Equations  for  computing  various  derivatives  mentioned 
here  will  be  provided  in  Section  5. 


3.2.  Gaussian  Pyramids 

It  is  not  uncommon  to  find  situations  where  the  minimization  solution  gets  trapped  in  local  minima.  This  may 
happen  when  the  error  function  is  not  exactly  concave  or  the  amount  of  change  allowed  in  the  parameters  do  not 
move  the  current  estimate  closer  to  the  global  minima.  As  a  result  the  solution  gradually  drifts  into  a  local  trough 
and  eventually  gets  trapped  inside  there.  Such  problems  can  be  handled  reliably  by  using  a  multigrid  relaxation 
approach.2*  These  methods  work  by  taking  advantage  of  multiple  discretizations  and  smoothing  of  a  continuous 
problem  over  a  range  of  resolution  levels.  Solution  to  a  minimization  problem  requires  computations  proportional  to 
the  spatial  distance  between  the  current  estimate  and  the  actual  solution.  This  suggests  the  possibility  of  speedup 
by  computing  the  solution  over  a  coarse  grid  and  then  enhance  it  by  successively  refining  the  grid.  Pyramids  are  one 
such  multi-resolution  technique  used  in  image  processing.26 

The  pyramids  used  in  our  implementation  are  called  octave  pyramids  as  at  each  level  the  image  is  halved  in 
each  dimension  and  subsampled.  Successive  reduction  in  the  resolution  and  subsampling  results  in  the  loss  of  high 


frequency  components  in  the  original  image.  In  other  words,  this  is  equivalent  to  filtering  the  image  through  low-pass 
filters  whereby  the  image  is  blurred  by  Gaussian  kernels  at  each  level.  Thus  at  the  coarsest  level,  it  may  be  assumed 
that  all  the  components  corresponding  to  the  local  minima  are  smoothened  enough  to  be  determined  as  possible 
points  of  solution.  Hence  when  successive  solutions  are  computed  from  the  coarsest  levels  and  propagated  to  the 
finer  levels,  the  solution  tends  towards  the  global  minima  and  eventually  it  may  be  expected  to  converge  to  the  actual 
global  minima. 


4.  DEFORMABLE  SHAPE  MODELING 

Shape  modeling  in  our  system  is  done  by  using  finite  element  models  (FEM).10’27  The  advantage  of  finite  element 
models  is  their  ability  to  enforce  a  priori  constraints  on  smoothness  and  amount  of  deformation,  which  in  general  is 
not  possible  in  statistically  based  or  optical  flow  based  methods.  FEM  is  a  numerical  approach  for  modal  analysis 
which  can  be  used  for  describing  non-rigid  deformations  of  an  elastic  body.  In  this  formulation,  an  object  is  modeled 
as  a  sheet  of  rubber  which  can  freely  deform.  The  surface  of  the  object  is  interpolated  by  Galerkin  method.28  A 
set  of  polynomial  functions  are  defined  that  relate  the  displacement  of  a  single  point  to  the  relative  displacements  of 
other  points.  Hence  all  the  points  can  be  expressed  in  terms  of  the  interpolation  functions  as  below: 

u(x)  =  H(x)U  (11) 

where  H  is  the  set  of  interpolation  functions,  x  is  a  vector  of  all  the  data  points  and  U  is  the  vector  of  displacement 
components  at  each  feature  point.  The  strains  produced  at  each  feature  point  due  to  the  displacement  are  obtained 
as  a  combination  of  the  element  strains  associated  with  the  feature  points: 

e(x)  =  B(x)U  (12) 

where  B  is  the  strain  matrix  and  e  is  a  vector  of  strains  produced  at  the  point  under  consideration.  The  problem  of 
modal  displacements  is  then  solved  as  a  dynamic  equilibrium  equation: 

MU+DU+KU=R  (13) 

where  M,  D  and  K  are  the  mass,  damping  and  stiffness  matrices  and  R  is  the  load  matrix.  The  reader  is  directed 
to  Ref.  10  for  detailed  derivation  of  all  the  mentioned  matrices.  The  non-rigid  deformations  are  then  expressed  in 
an  orthogonal  system  where  the  basis  is  defined  as  the  set  of  orthonormalized  eigenvectors  of  M-1K.  Given  that  x 
is  the  set  of  all  feature  points,  the  locations  of  the  new  feature  points  is  given  as  follows: 

m 

x'=X  +  ^jUj  (14) 

i=l 

where  x  is  the  mean  displacement  position,  x'  is  the  deformed  position,  Uj  is  the  jth  mode  parameter  value  and  <f)j 
is  the  jth  eigenvector  defining  the  jth  modal  displacement.  The  system  can  be  re-orthogonalized  to  separate  the 
affine  parameters  from  the  modal  parameters. 

In  our  formulation,  we  use  an  FEM  based  technique  called  active  blobs 29  as  a  tool  to  register  prototype  images. 
Initial  blob  of  a  reference  object,  Iref  is  created  by  associating  a  deformable  polygonal  mesh  with  the  object  texture 
map.  Registration  of  a  novel  image,  lx,  is  then  solved  as  an  energy  minimization  problem  where  the  shape  parameters 
(in  our  case,  the  finite  element  modes)  are  estimated  so  that  difference  between  the  warped  reference  object  and  the 
novel  object  is  minimized  by  least  squares  approach.  The  energy  minimization  problem  is  formulated  as  follows: 
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where  I^x^yi)  is  the  intensity  of  the  pixel  at  location  (xi^yi)  in  the  inverse  warped  target  image  Ji  and  Iref&uVi) 
is  the  intensity  of  the  pixel  at  the  same  location  in  the  reference  image.  The  adverse  effect  of  the  outliers  that  tend  to 


throw  the  minimization  process  out  of  track  are  handled  by  using  a  robust  error  norm  which  is  a  Lorentzian  influence 
function  p ,  given  as: 
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p(ei,cr)  =  log(l  +  2^2) 


(17) 
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where  a  is  an  optional  scale  parameter. 

5.  IMPLEMENTATION  STEPS 

We  have  divided  the  implementation  into  three  stages,  namely  average  image  computation ,  training  phase  and 
matching  phase.  We  use  Levenberg-Marquardt  method24  for  minimization  during  the  matching  phase.  Below,  we 
provide  a  brief  description  of  each  step.  The  readers  are  directed  to  Ref.  30  for  detailed  description  of  the  algorithm. 

5.1.  Average  Image  Computation 

It  may  be  the  case  that  some  prototypes  are  more  similar  and  hence  may  form  clusters  in  the  prototype  subspace.  If 
the  reference  image  happens  to  be  selected  from  one  such  cluster,  then  it  may  not  register  well  with  prototypes  from 
other  clusters.  This  phenomenon  is  called  true  shape  vulnerability .31  In  order  to  avoid  this  problem  we  use  the 
average  image,  which  will  be  fairly  equidistant  from  all  prototypes,  for  registration.  This  average  image  is  computed 
in  an  iterative  fashion.  We  start  with  an  arbitrary  reference  image  Iref .  The  user  circles  out  the  region  of  interest 
from  which  a  blob  is  created.  This  blob  is  then  registered  with  the  remaining  prototypes.  A  new  reference  blob  is 
created  by  averaging  the  shape  and  the  texture  parameters  of  the  prototypes,  obtained  by  the  process  of  registration. 
This  process  is  repeated  to  obtain  new  reference  blobs,  until  the  difference  between  the  new  reference  blob  and  the 
old  reference  blob  drops  below  a  threshold. 

5.2.  Training  Phase 

Once  the  reference  image  has  been  computed,  the  system  is  trained  by  registering  all  the  prototypes  with  the  reference 
blob.  In  our  implementation,  the  mode  values  required  for  deforming  the  reference  blob  to  match  the  prototype  are 
stored  as  the  shape  vectors  and  the  inverse  warped  prototype  images  are  stored  as  the  texture  vectors. 

5.3.  Matching  Phase 

Matching  of  a  novel  image  is  done  by  the  minimization  of  the  objective  function  Equation  (6).  The  minimization  is 
performed  by  Levenberg-Marquardt  method.  The  first  and  second  derivatives  of  the  objective  function,  required  for 
the  minimization  process  are  computed  according  to  the  first  approximation  principle  for  derivatives.  For  simplicity, 
we  use  forward  warping  instead  of  inverse  warping.  We  employ  a  further  constraint  on  the  shape  coefficients  such 
that  they  sum  to  1  in  order  to  address  redundancy  due  to  the  modeling  of  the  affine  parameters  in  the  FEM  model. 
The  equations  for  the  objective  function  along  with  its  required  derivatives  are  provided  below: 

1  N 

2  Y}lnovei{x,y)  -  W(b  •  T,  c  •  S)(x,y)]2  +  7(£  ck  -  l)2  (19) 
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The  second  derivatives  of  the  given  function  are  approximated  as  below: 

d2E  =  3E  3E 

dmidrrij  dmidruj  ^ 

where  mk  =  Ck  or  These  derivatives  are  then  substituted  into  Equations  9  and  10  to  compute  the  gradient 
vector  and  the  approximate  Hessian  matrix  required  in  the  Levenberg-Marquardt  method.  The  parameter  vector  q 
is  defined  as  a  composite  vector  of  the  shape  and  the  appearance  parameters: 

<?  =  [ c|b]  (25) 

q  is  updated  by  the  change  in  the  parameter  vector,  A q  estimated  by  Equation  7.  This  process  is  iterated  until 
the  final  error  magnitude  drops  below  a  given  threshold  or  a  fixed  number  of  iterations  are  completed.  The  A  in 
Equation  7,  acts  as  a  time-varying  control  parameter  that  forces  the  solution  to  follow  the  steepest  gradient  in  order 
to  converge  to  the  minimum. 


6.  RESULTS 

The  system  was  tested  with  two  types  of  datasets,  namely  face  images  and  sequences  of  heart  images,  and  was  tested 
with  some  novel  images  that  were  not  present  in  the  training  set.  The  main  points  for  which  we  tested  the  system 
are  following: 

•  the  algorithm  should  be  able  to  reconstruct  novel  images  by  appropriately  combining  prototype  images. 

•  the  algorithm  should  be  capable  of  handling  significant  variations  (e.g.  “gender”,  in  our  experiments  for  the 
face  images) . 

•  test  the  robustness  of  linear  combinations  paradigm  to  reconstruct  novel  images,  that  belong  to  the  same  object 
class  but  have  not  been  seen  in  the  training  set  (e.g.  the  algorithm  can  reconstruct  images  of  men  without 
mustaches,  by  appropriately  combining  images  of  women  and  men  with  mustaches) . 

•  generalization  of  the  technique  to  objects  other  than  faces  (e.g.  heart  images,  in  our  experiments). 

6-1-  Test  Set  Is  Face  Images 

The  code  for  registration  of  images  was  taken  from  Active  Blobs  which  is  available  on  the  internet*.  The  prototype 
set  comprises  of  random  face  images  drawn  from  the  MIT  database*  (see  Figure  2).  Several  novel  face  images,  which 
were  not  present  in  the  training  set,  were  tested.  All  of  those  could  be  reconstructed  in  the  combination  of  parameters 
paradigm  described  earlier.  The  images  are  of  dimension  128x128  pixels.  The  size  of  the  faces  inside  the  images 
was  typically  around  64x64  pixels.  The  implementation  makes  extensive  use  of  the  graphics  hardware  for  texture 
mapping  and  bilinear  interpolation.  Currently  the  reconstruction  of  a  novel  face  image  takes  around  8  minutes 
on  a  R5000  SGI  02,  180  MHz  machine.  Majority  of  time  is  spent  in  combining  the  prototype  images  at  each 
iteration  for  the  reconstruction  of  novel  image.  The  texture  vectors  for  the  prototype  images  comprise  of  the  whole 
texture.  Significant  speedup  is  possible  by  dimensionality  reduction.  In  future,  we  intend  to  evaluate  the  system 
with  dimensionally  reduced  prototype  texture  vectors,  where  we  will  use  coefficients  obtained  by  projecting  prototype 
images  into  the  eigenspace  instead  of  textures.  We  expect  the  performance  of  the  system  would  improve  as  we  will 
have  to  combine  less  number  of  eigen-images  for  reconstruction.  We  can  further  reduce  the  computation  time  by 
doing  minimization  on  only  one  color  channel.  The  average  image  for  the  dataset  is  given  in  Figure  3(a)  and  some 
results  of  matching  of  novel  images  have  been  provided  in  Figures  3(b),  (c),  (d),  (e)  and  (f). 


*  http://www.cs.bu.edu/groups/ivc/ 

*ffcp:  / /whitechapel.media.mit.edu/pub/images/ 


Figure  2.  Prototype  face  images  (^prototypes  =  100).  This  training  set  comprises  of  75  images  of  males  with 
mustaches  and  25  images  of  females. 


Figure  3.  Linear  combination  of  face  images:  (a)  Average  face  image;  (b),  (c),  (d),  (e),  (f)  Reconstructed  novel 
face  images  obtained  by  the  combination  of  shape  and  texture  parameters  of  the  prototype  face  images  (Left:  input 
novel  image;  Middle:  average  image  registration;  Right:  reconstruction  of  the  circled  region). 

6.2.  Test  Set  2:  Sequences  of  Heart  Images 

We  tested  the  system  on  images  of  heart  taken  from  the  MIT  heart  database^,  in  order  to  evaluate  the  generality 
of  the  approach.  Since  there  were  only  38  images,  we  included  all  the  odd  numbered  images  in  the  training  set  and 
used  the  even  numbered  images  as  novel  images.  The  images  used  for  training  are  given  in  Figure  4(a).  The  average 
image  for  this  sequence  of  images  and  reconstruction  of  some  novel  images  are  given  in  Figures  4(b)  and  4(c),  (d), 
(e)  and  (f).  As  may  be  seen,  the  approach  was  able  to  reliably  reconstruct  various  intermediate  stages  of  heart 
pumping.  The  estimated  shape  and  texture  parameters,  obtained  from  the  reconstruction,  can  be  used  for  various 
medical  applications. 


7.  DISCUSSION 

Levenberg-Marquardt  method  is  a  quadratic  minimization  technique,  that  requires  significant  amount  of  time  for 
computation  of  the  Hessian  matrix  at  each  step.  The  major  bottleneck  is  the  number  of  floating  point  multiplications 
involved  which  is  0(n2m2)  where  each  image  has  0(n 2)  pixels  and  there  are  0(m)  prototype  images.  This  issue 
needs  to  be  addressed  in  order  to  make  the  matching  process  real-time.  Currently  we  are  exploring  different  heuristics 
for  speedup.  These  heuristics  are  primarily  focussed  on  reducing  the  net  computations  and  the  actual  number  of 
prototypes  to  combine. 

A  possible  approach  to  significantly  reduce  the  computation  time  at  each  step,  is  to  compute  the  Hessian  matrix 
over  random  pixels,  rather  than  over  complete  images.  This,  intuitionally,  simulates  the  stochastic  gradient  method ,24 
but  would  be  more  efficient  as  it  would  converge  to  the  solution  faster  by  taking  larger  step  sizes  (implicitly  controlled 
by  A)  at  each  iteration,  provided  the  error  function  happens  to  be  quadratic. 

The  number  of  computations  involved  for  minimization  is  quadratically  related  to  the  number  of  prototypes. 
Hence,  selection  of  an  optimal  set  of  prototypes  is  paramount  to  reducing  the  implementation  time.  Currently,  a  set 
of  prototypes  is  chosen  randomly  and  the  training  is  done  on  this  set.  If  three  prototypes  are  thought  to  lie  on  a  line 
in  the  prototype  space,  then  any  number  of  extra  prototypes  on  the  same  line  are  redundant  and  hence  should  be 
singled  out.  Though  different  statistical  methods  like  Jfe-means  clustering,  hierarchical  clustering,  Bayes  classifier  etc. 
can  be  used,  a  normal  tradeoff  involved  is  that  typical  pattern  recognition  methods  require  large  training  data  sets 


§  ftp: //whit  echapel.media.mit.edu/pub/images/ 


Figure  4.  Linear  combination  of  heart  images:  (a)  Prototype  heart  images  (#prototypes  =  20.  This  training  set 
comprises  of  various  intermediate  images  of  contraction  and  expansion  of  the  human  heart  while  pumping  blood); 
(b)  Average  heart  image;  (c),  (d),  (e),  (f)  Reconstructed  novel  heart  images  obtained  by  combining  the  shape  and 
texture  parameters  of  the  prototype  heart  images  (Left:  input  novel  image;  Middle:  average  image  registration; 
Right;  reconstruction  of  the  circled  region). 

which  are  diverse  enough  to  characterize  the  whole  object  class.27,32  We  are  currently  exploring  different  techniques 
to  be  able  to  select  sets  of  “good”  prototypes  in  future. 

8.  CONCLUSION 

We  presented  a  model-based  linear  combinations  approach  for  modeling  objects.  The  methodology,  implementation 
status,  results  obtained  so  far  and  possible  explanations  of  various  observed  behavior  have  been  described.  Apart 
from  these,  the  method  was  compared  with  existing  active  appearance  model  and  the  pros  and  cons  were  brought 
out.  Also  various  ways  of  extending  the  existing  framework  have  also  been  described.  In  future,  we  plan  to  implement 
clustering  algorithms  for  appropriately  choosing  the  representative  set  of  prototypes.  Apart  from  this,  we  intend  to 
study  the  application  of  the  given  approach  for  various  computer  vision  problems  viz.  recognition,  image  registration 
and  analysis,  image  compression  and  morphing. 
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